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Abstract. Assume in a sample of size M one finds Mi representatives of species i with i = 1 . . . A'^* . 
The normalized frequency p* = Mi /M, based on the finite sample, may deviate considerably from the true 
probabilities pi. We propose a method to infer rank-ordered true probabilities r; from measured frequencies 
Mi. We show that the rank-ordered probabilities provide important informations on the system, e.g., the 
true number of species, the Shannon- and the Renyi-entropies. 
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1 Introduction 

In experimental work one frequently faces the problem to 
determine the probabilities of occurrence (or concentra- 
tions) pi, p2, ■ ■ ■ ,Pn of species 1, 2,. . . , TV. The probability 
of species i is defined by 



Pi — iim 



i = l...N 



N 



(1) 
(2) 



with Mi being the number of representatives of the species 
i found in a sample of size M . N is the number of dif- 
ferent species which will appear in a sample of infinite 
size. Of course M will never be infinite in reality, but a 
number which is determined mainly by the experimental 
effort, i.e., usually costs and time. In this article the term 
"species" is not used in its strict phylogenetic sense, but 
it stands as a synonym for "distinguishable events which 
are members of a statistical ensemble" . 

A prominent example where we cannot reliably infer 
the probabilities from counted frequencies is the distribu- 
tion of words of length n in nucleotide sequences such as 
DNA. Since we have an alphabet of four letters, there are 
4" words which in principle could be constructed. Even for 
moderate values of n, the number of words exceeds the size 
of any available data base. If we want to compute, the en- 
tropy of the word distribution in biosequences we have to 
apply, therefore, correction methods, e.g. |lli2i3ii4«5«6i7) . 

Virtually each experimental measurement of concen- 
trations (or probabilities) is affected by finite size effects 



due to the feasible number M of samples which can be 
investigated. In a real measurement one cannot even ex- 
pect to find the correct number N of species. Instead, in 
general, a smaller number N* is found, depending on the 
sample size AI. We will show that, even if M is a rather 
large number, the deviations of the observed relative fre- 
quencies 

p* = ^ , M finite (3) 

from the true probabilities pi as defined by Eq. |^ may 
be significant. A method to deduce true probabilities from 
measured relative frequencies is, therefore, highly desir- 
able. 

The aim of this article is to propose a method to cor- 
rect relative frequencies p* = Mi/M in a finite sample 
of size M in a way to approximate the true probabilities 
Pi which would be obtained if a sample of infinite size 
was investigated. Our method is based on the idea, that 
the estimation of rank-ordered probabilities is by orders of 
magnitude more easy than the estimation of the species- 
ordered probabilities. This is due to the fact that in the 
rank-ordering procedure the exact relation between the 
species number i and the probability pi is ignored. This 
way it remains to estimate the shape of a function Vi which 
is monotonously decreasing with the rank i (see Sec.Elfor 
the definition of r^). The large interest in rank-ordered dis- 
tributions is based on the fact that several characteristic 
quantities as the Renyi-entropies [H| 



(4) 



are invariant with respect to the ordering. Therefore, the 
rank-ordered distribution suffices to compute the Renyi 
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entropies. We notice the important relations M = expH^^\ 
H = H^'^\ In other words, the true number of species M 
as well as the Shannon entropies H are exactly calculable 
from the rank-ordered distributions r.^ . In the last section 
we will show, how our method can be extended also to 
estimate any mean value of statistical variables. 

2 Species ordered distributions and 
rank-ordered distributions 

Assume we draw a sample from a system of N different 
species which are equally distributed pi = P2 = ■ ■ ■ = 
Pn — l/N. Symbols with upper index * such as p* denote 
observed quantities in a finite sample of size M . Obviously, 
if M is large enough the relative frequencies approach the 
probabilities, pi ^ pi, P2 ^ P2 ■ ■ ■ , Ppf Pn due to Eq. 

Figure n shows the observed relative frequencies for 
three different sample sizes M for 1000 equidistributed 
species with pi — p2 — ■ ■ ■ — piooo — 1/1000. For this 
figure we produced uniformly distributed random integers 
from the interval [0, 999] and counted the occurrences of 
each number. As expected, with increasing sample size M 
the distribution resembles more and more the equidistri- 
bution in agreement with the true probabilities pi. Nev- 
ertheless, the deviations of the relative frequencies from 
the probabilities are significant: even in the case of rather 
large relative sample size M/N = 1000 (Fig. ^ bottom) 
the deviation of the relative frequencies from the prob- 
abilities can be as large as p*/pj ~ 1.11. For the case 
M/N = 1 (top of the figure) we can see that many species 
have not been found at least once. 

To quantify the deviations and for practical purposes 
that will be motivated below, we will use the data repre- 
sentation given in Fig. El Here the same data as in Fig. 
m are displayed, however, the abscissa does not show the 
species label i but the species are ordered according to the 
frequency of their occurrence in the sample. This means 
the species which occurs with the largest number of repre- 
sentatives appears at the first position (1) of the abscissa, 
the species found with the second largest frequency is la- 
beled 2, etc. We call this representation rank- ordered dis- 
tribution of frequencies where r* is the observed relative 
frequency of the species at rank i. 

Figure |21 clearly reveals that even for large relative 
sample size M/N the observed rank-ordered frequencies 
rl may deviate considerably from the probabilities pi . For 
smaller sample size M/N = 1 (top of the figure) about 1/3 
of the species are not even found once, i.e., the observed 
number of species may be smaller than the true number, 
N* < N. 

The rank-ordered relative frequencies r* form, by def- 
inition, always a decaying function. In the limit M — > cx) 
this function approaches the rank-ordered probabilities 
which coincide with pi for the case of the equidistribu- 
tion as well as if the probabilities pi are decaying with 
increasing label number i of the species (see examples in 
the following sections). As mentioned, this limit is difficult 
to achieve when N is large. For M = 10^ (Fig- [21 niid- 
dle) we find a distribution that is far from being uniform. 
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Fig. 1. Observed relative frequencies of A*' = 1000 equidis- 
tributed species found in samples of size M = 10^ (top), 
M = 10" (middle), and M = 10^ (bottom). 



Even for M = 10^ the inset shows a rank-ordered distri- 
bution which deviates considerably from the equidistribu- 
tion. Hence, from an observation one might erroneously 
conclude that the events are non-equally distributed. 

The rank-ordered probabilities (z is the rank index) 
contain less information than the species-ordered distri- 
bution Pi since the co-ordination species <-> probability is 
lost. The problem to infer the rank-ordered probabilities 
ri from a sample of size M is, therefore, a much simpler 
problem than to infer the species related probabilities pi . 
In general, the rank-ordered distribution contains about 
A^! times less information than the species- number ordered 
distribution pi, since about A'^! species-ordered distribu- 
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Fig. 2. Rank-ordered observed relative frequencies A'^ = 1000 
equidistributed species in a sample of size M = 10^ (top), 
M = 10" (middle), and M = 10^ (bottom). The inserts show 
the same data with higher resolution. 



tions correspond the same rank-ordered distribution: 



.,iV} 



P1,P2,P3---PN 
P2,Pl,P3 ■ --PN 
P3,P2,Pl ■ --PN 



Pj, j = 1, • ■ ■ , ^, {.]} = perm{i} 

More precisely, the number of species-number ordered dis- 
tributions is slightly smaller than the number of permu- 
tations of the species numbers A'^! since there might be 



species which occur at the same probability so that their 
permutation does not affect the distribution. 

From these arguments we conclude that it is about 
A^! times simpler to infer the rank-ordered probabilities 
ri from the investigation of a sample of size M than the 
species-ordered probabilities pi. Or, in other words, a sam- 
ple of size M allows to determine the rank-ordered prob- 
abilities up to a much higher accuracy than the species- 
ordered probabilities. 

Before coming to the main point, the estimate of prob- 
abilities from finite sample observations, it is helpful first 
to consider the inverse problem. 



3 Predicting observed relative frequencies 
from a probability distribution 

3.1 Equidistributed species 

For the description of the species-ordered observed relative 
frequencies {p*,i = 1, • • • , N}, in general — 1 numbers 
are required, whereas for the corresponding rank-ordered 
relative frequencies, {r*}, it is sufficient to specify, how 
many species did not appear in our sample (this quantity 
will be denoted by fco), how many species occurred with 
one representative (fci), how many with two representa- 
tives (fe), etc. An observed rank-ordered distribution of 
relative frequencies is, hence, determined by a set of oc- 
cupation numbers {ki, z = 0, 1, 2, .., M}. In this section we 
describe a method to predict the observed rank-ordered 
relative frequencies r* from a probability distribution, pi , 
for finite sample size M. 

The observed distribution {r*} is characterized by the 
cluster distribution {kj}: the number of species that ap- 
pear with j representatives each in a sample of size M. We 
define the probability distribution pc {ki, i) to find exactly 
ki species each occurring with precisely i representatives 
in a sample of size M. With the normalization conditions 



M 



ki — N (total number of species) (6) 

1=0 
M 

i ki — M (total number of individuals) , (7) 

1=0 

for the case of equidistributed species this distribution 
reads (see also |l()lllll:j| ) 



Ml L^^/*J 

J=k. 



(jV_j)(M-3») 



(8) 

where [a;J denotes the integer of x. 

The observable (ki), i.e., the average number of species 
that occur with i representatives when a sample of size M 
is drawn, is the first moment of this probability distribu- 
tion |1UI11| (ki) = J2k ^iPc {ki^i), where the summation 
is to be performed over all cluster distributions which are 
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in agreement with Eqs. © and 0: 



1000 



v * 



1 



N 



(M-i) 



(9) 



The occupation numbers i = 0, 1, 2, • ■ • are called the i- 
clusters; (fco) is then the average size of the cluster of 
species which do not appear in our sample, (fci) is the size 
of the cluster of species which appear with one represen- 
tative, etc. 

Obviously, for small M N, almost all of the N 
species which could be found in principle, belong to the 
0-cluster, i.e., they do not appear in a sample of size M. 
As M increases the number of single occupations (fci) in- 
creases as well, consequently (fco) decays. For still grow- 
ing M the number of multiple occupations becomes larger 
and, therefore, the sizes of the 0-cluster and 1-cluster de- 
crease. Figure 01 shows the sizes of the first clusters, (fco) 
to (fcs), as a function of the sample size M. The lines show 
the theoretical result Eq. Q and the symbols in the top 
of Fig. O show the clusters as they have been found in 
numerical simulations using equally distributed random 
numbers. 
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Fig. 3. Expectation values (fci) for cluster sizes i = 0, ..,5 
over the sample size M taken from a set of A*' = 1000 equidis- 
tributed species. The lines show the theoretical result Eq. @. 
The symbols show the cluster sizes found by numerical simula- 
tions. The lower figure shows the same data for a larger range 
of the sample size M. 



As a special case (fco) allows to determine the number 
of different species N*, which are expected to be found 




2000 

sample size Af 

Fig. 4. Number of species A''* found in a sample of size M 
if N = 1000 species occur all with the same probability d = 
1/1000. The dashed line shows the analytical result Eq. (Illll . 
the impulses show the results of a computer simulation. 



in a sample of size AI. This number is given by the total 
number of species N minus the number of species which 
we expect to find with zero representatives: 



N*=N- (fco) 



I.e. 



N* 



1 



1 



M 



(10) 



(11) 



Figure ^ shows the corresponding simulation results for 
N = 1000. For sample size M — 5000 we notice that the 
average number of found species is N* « 993, i.e., on 
average about 7 species are not found. For M — 8000 the 
average number of found species is N* k, 999.67, here we 
can be optimistic to have found at least one representative 
of all species. For practical purposes it may be useful to 
note that even for rather small values of N , Eq. Hll|l can 
be approximated with very good accuracy in the entire 
range of M by 



^=i=approx 

N 



1 



M 



(12) 



The maximal absolute deviation is N* -iV*'^PP''™ = 1/e « 
0.37 which falls rapidly to l/(2e) « 0.18 as N goes to 
infinity. 

Using Eq. © for the expectation values (fci) we obtain 
directly the observed rank-ordered distribution of relative 
frequencies 12T3 : 

ioT N >i> N - (fco) 
l/M for N - (fco) >i>N- (fco) - (fci) 



i/M for N-Y. (ks) >i>N-Y: (ks) 

. . (13) 

Using Stirling's formula to expand the expressions in Eq. 
(O the analytical result Eq. (|13|l can be written easily in 
elementary functions. 

Figure [S] shows rank-ordered relative frequencies cal- 
culated from a sample of random numbers (dashed lines) 
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together with the theoretical distributions due to Eq. I|13|l 
(solid lines). (To plot more than one curve in the same 
figure we show the absolute frequencies Mr*, i.e., what is 
shown are the absolute numbers of occurrence of species 
i in a pool of size M.) The combinatorial theory sketched 
above predicts the rank-ordered relative frequencies which 
results from an equi-probability distribution with good ac- 
curacy. 




rank number / 

Fig. 5. Numerically determined rank-ordered relative frequen- 
cies r* for samples of size M (dashed lines). The A'' — 1,000 
species occur with equal probability, Pi — 1/1000. The theo- 
retical curves due to Eq. (I13II are drawn with solid lines. 

The figure demonstrates that indeed we are able to 
predict analytically the rank-ordered observed frequencies 
which appear when a sample of size M is drawn from 
N equally distributed species. We may turn the question 
around and answer the question: Which sample size is 
required to make sure that that at least N* out of N 
species are observed. The required sample size is in good 
approximation 

N 

M = N\og ^_^^ =-N\ogC. (14) 

Here C is the percentage of species which is admittedly 
is not to be represented in the sample. If we admit that 
about 5% are not represented we find, e.g., M ~ 2N, i.e. 
the size of the sample should be about double the esti- 
mated species number. This estimate may be important 



for the planning of observations of nearly equally probable 
species. 

3.2 General distributions 

3.2.1 An alternative motivation of Eq. © 

The derivation of Eq. © for the case of equidistributed 
species requires some algebra and in this work we will not 
derive a corresponding equation for general distributions. 
For the equidistribution the order of the rank-ordered em- 
pirical frequencies always corresponds the order of the 
true probabilities since all true probabilities are identi- 
cal, i.e., reordering the true probabilities leads always to 
the equidistribution. This is different in the general case: 
Due to fluctuations it may happen that p* < p* although 
Pi > Pj . The probability for the species to change ranks in 
the empirical distribution depends on the difference pi —pj 
(the larger the difference the less probable they change 
ranks due to fluctuations) and on the sample size (the 
larger the sample size the less are the fluctuations, hence, 
the smaller the probability to change ranks). A compre- 
hensive calculation must take these exchange probabilities 
into account. 

Nevertheless, we wish to present an hypothesis which 
can be checked by numerical simulations. We will demon- 
strate that although the theoretical derivation of (ki) for 
the case of a non-uniform probability distribution is some- 
what simplified, the predicted results agree well with nu- 
merics. 

Let us discuss an alternative motivation of Eq. © : As- 
sume there are N species occurring with the same proba- 
bility pi = P2 = ■ ■ ■ = Pn = ^/N. The probability to find 
exactly i representatives of species j in a community of Al 
individuals, is given by the binomial distribution 

m-[^.)p]a-p,f~' ■ (15) 

The probability to find any species exactly i times in a 
community of size M is the union of species 1 appearing i 
times, species 2 appearing i times, etc. Since these events 
do not exclude each other for i < N/2 one cannot sum 
directly the probabilities. Instead, one has to apply the 
inclusion-exclusion principle |5| to subtract the intersec- 
tion probabilities which in fact has been done to derive 
Eq. lO, see [11)1111 . Let us see what happens if we ignore 
the intersection probabilities: The expectation value for 
the number of species which appear in a sample of size M 
exactly with i representatives reads then 

which is identical with Eq. I^. We want to point out again 
that the derivation of the first moments is incomplete but 
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it yields the correct result. In contrast to the exact deriva- 
tion this simple motivation for the equidistribution has the 
great advantage that it can be generalized to the case of an 
arbitrary distribution. In general, according to Eqs. Ill5(l 
and (|16|l . the expectation value for the number of species 
which appear in a sample of size M exactly i times is 



N 



(17) 



It can be easily checked that this distribution has the cor- 
rect normalization imposed by Eqs. © and 0. 

Having always in mind that we have no rigorous proof 
for the correctness of this result yet, we want to check its 
validity by numerical simulations. 



3.2.2 Example: step-wise equidistribution 

We wish to demonstrate the application of Eq. (|17|l using 
a step-wise equidistribution of iV = 102 species. Let us 
assume for the probabilities: 



p« = 15/(8iV) 
pp = 6/(8iV) 
= 3/(87V) 



for 
for 
for 



I <i< iV/3 
N/Z < i < 2N/3 
2N/3 <i<N. 

(18) 



The normalization can be checked easily: Pi — ^ - Fo'' 
these probabilities of the species we obtain from Eq. (|17|l 



(19) 



iV 



M-i 



N 
"3 



M-i 



The expected rank-ordered empirical relative frequencies, 
, can be found from Eq. H13|l in the same way as previ- 
ously: (fco) is the number of species which on average will 
not be found in a sample of size M, (fci) is the number 
of species which will appear with one representative, (^2) 
species are with two representatives each, etc. and finally 
(fcAf) is the number of species which are expected to be 
found with M representatives. Obviously, no species can 
appear with more than M representatives since our sam- 
ple is of size M. To generate the rank-ordered observed 
relative frequencies we notice that on average these values 

i-l 

jump from {i + l)/N to i/N at rank-positions N— ^ (ki). 

s=0 

Hence, the expected empirical relative frequencies are 





1/M 
2/M 



for 
for 
for 



N >i> N - (fco) 
N~{ko) >i>N-{ko)-{ki) 
iV- (fco) - (fci) >i> 

>7V-(fco)-(fci)-(fc2) 



1 for iV - J:Zo (f's) > « > 

(20) 

Note that, in general, the average cluster sizes (kg) and, 
therefore, i are not integers. To check this formula, in Fig. 



El we show the true probability distribution due to Eq. 
(I18|l (dashed lines), the prediction of the observed relative 
frequencies due to Eq. (|20|l (solid lines) and the results of 
a Monte Carlo simulation (circles), where the data have 
been averaged over 100 independent drawings of random 
numbers. The analytical and numerical results agree with 
good accuracy. 



3.2.3 Example: exponential distribution 

As a second example we wish to check the validity of Eq. 
(I17|l by means of a (shifted) exponential probability dis- 
tribution 



n = Pt 



1 — exp{aN) 



with <i < N, i.e., J^QPi — 1. From Eq. fTTjl we obtain 



exp(— ai) 



(21) 



Ml 



i\{M 



l-r, 



N 



N 



1 - r. 



1 - 



1-r,; 



1 — r. 



N ' i 



M- 



(22) 



Figure [71 shows the theoretical predictions together with 
results of numerical simulations. Again, theory agrees well 
with the numerical results. Due to the excellent agreement 
of the Eq. 1)22(1 with numerics we conclude that Eq. (|17|l al- 
lows to predict the observed relative frequencies, provided 
the true probability distribution is given. 



4 Inferring probabilities from experiments 

4.1 How to determine probabilities? 

In strict sense, probabilities cannot be determined by ex- 
periments for an obvious reason: even for a fair die it is 
mathematically possible, although not very probable, to 
cast the die 1000 times and to find 1000 times the six. 
From such a measurement, of course, one would hardly 
conclude that the die is fair, i.e., that the sides one to 
six appear with equal probability. Hence, we have to re- 
quire that the measurement is representative. The strict 
definition of this term is not easy since for M = 10 both 
measured sequences, 5 — 5 — 3 — 4 — 2 — 6 — 5 — 6 — 1 — 3 and 
6 — 6 — 6 — 6 — 6 — 6 — 6 — 6 — 6 — 6 occur with equal proba- 
bility. The great advantage of working with rank-ordered 
measurements is that the order of the measured sequence 
is irrelevant, i.e., the measurement 5 — 5 — 3 — 4 — 2 — 
6 — 5 — 6—1 — 3 would lead to the same measured rank- 
ordered frequencies as 2 — 6 — 6 — 3 — 4 — 5 — 5 — 3 — 1 or 
5—5 — 5—3—3—6—6—1—2—4. In this sense, a rank-ordered 
measurement from a sequence 5—5—3—4—2—6—5—6—1—3 
represents many more possible measured configurations 
than 6 — 6 — 6 — 6 — 6 — 6 — 6 — 6 — 6 — 6. Similar as 
in statistical mechanics for the derivation of the canonical 
distribution we will call a measurement representative if 
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Fig. 6. Rank-ordered relative frequencies of step-wise equidis- 
tributed species due to Eq. 1181 for sample sizes M = 50, 000, 
M = 5,000, M = 500, and M = 100 (top to bottom). The 
dashed lines show the probabilities due to Eq. 1181 . the full 
lines show the predicted relative frtequencies due to Eq. 1201 1 
and the circles show the results of a Monte Carlo simulation. 
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Fig. 7. Rank-ordered relative frequencies for exponentially dis- 
tributed species as defined by Eq. H21|l with a = 0.05, A'^ = 100 
and M = 200 (top), a = 0.05, = 200, M = 100 (middle), 
a = 1.0, N = 100, M = 500 (bottom). The theoretical results 
(solid lines) are due to Eqs. H2()|l and 1)22^ . the circles show 
the distribution of a single set of M random numbers from the 
interval [1 . . . A'^] drawn due to the distribution Eq. 121^ , and 
the dashed lines show averages over 100 such experiments. 



there are many equivalent measurements (permutations) 
which all belong to the same rank-ordered sequence. 

In strict mathematical sense we have to repeat the ex- 
periment of drawing a sample of size M an infinite number 
of times in order to get an averaged and representative 
set of rank-ordered frequencies. If we, however, had all 
these measurements the method presented in this article 
would turn out to be meaningless since for an infinite set 
of measurements the observed probabilities approximate 
the true ones, see Eq. Following the same argumenta- 
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tion, in order to measure the pressure of air in a room we 
would also need an infinite set of measurements since there 
is a non-zero probability (although never observed under 
common conditions) that all air molecules are located in 
one half of the room and our manometer would show the 
double pressure or zero, depending on which half of the 
room is populated. Therefore, there is not much difference 
between measuring the pressure of air and inferring proba- 
bilities from measurements: in both cases one relies on the 
fact that a representative measurement is, by definition, a 
very probable one. 



4.2 Optimization of cluster distributions 

Equation (|17|l allows to predict in a systematic way the 
expectation values of the clusters sizes (ki), i = . . . M, 
provided the probabilities pi, i = 1 . . . N, are known. Note 
that the average cluster sizes (ki) based on the species- 
ordered probabilities pi are identical with those based on 
the rank-ordered probabilities r^. We can say then that 
Eq. ifTTI states the relation between the observed rank- 
ordered frequencies r* and the rank-ordered probabilities 
ri. This relation permits to infer the rank-ordered proba- 
bilities from a set of observed frequencies r* . 

In this section we propose a variational method to es- 
timate the distribution {ri,i = 1, . . . , N} from data of a 
measurement. 

Consider a set of experimentally determined cluster 
sizes fc,^^^, i = 1...M. This set can be determined by 
counting, how many species in a sample of size M ap- 
peared with one individual in the sample (k^^^), with two 
individuals {k2^^), etc. We assume further that the set 
of experimentally determined relative frequencies (and, 
therefore, cluster sizes) is representative in the sense as 
discussed in Section l4. II i.e., we assume that there exists 
a (unknown) probability distribution which leads to the 
averaged cluster sizes (ki) « fc™'^, (fc2) ~ etc. 

Equation (|17|l establishing the relation between the 
probabilities and the averaged cluster sizes (ki) allows 
to construct a variational scheme. This is done by con- 
structing the dimensionless objective function 



with e being a small number. Using Eqs. (|17|l and H15|) we 
obtain 



M 

E 

i=0 



(23) 



and requiring that it is minimal for the (unknown) set of 
probabilities r^. The index (k) of 4'(k) indicates that the 
objective function refers to the cluster distribution. 

Starting from a trial initial set of rank-ordered prob- 
abilities Ti, e.g. the equidistribution, and an initial trial 
number of species N, e.g. the observed number of species 
N* (implying that k^^^ — 0), the probabilities can be ap- 
proximated numerically by a gradient method 



dri 



M 

E 



dr., fr'^dikj) dr^ 

M 



(25) 



The so modified rank-ordered probabilities have to be 
normalized 

N N 

E'^^=E^'' = 1- (26) 

i=l 1=1 

Equation (|24|) and subsequent normalization has to be ap- 
plied till convergence of ?/'(fc)(A^) is achieved. Of course, the 
initial value N might not be the true number of different 
species, i.e., on top of the for fixed N we have to opti- 
mize the value of N itself. This can be done by performing 
a sequence of minimizations for different values of TV rang- 
ing from the observed value N* till some iVmax- The result 
of this set of minimizations is the function ^(k) (N), which 
has to be minimum for the most probable value of N. 

For several examples we have been able to determine 
the probabilities up to good accuracy. This method has, 
however, a drawback: for the case of rather large sample 
size M, when the observed relative frequencies approx- 
imate the probabilities, we expect that it is simpler to 
infer probabilities from observed frequencies. Instead, for 
increasing M it becomes more and more difhcult since the 
cluster sizes (ki) become small. This can be seen, e.g., 
from Fig. [2 in the upper figure for M = 10^ the typical 
size of the clusters is ki ~ 100, whereas in the lower figure 
drawn for M = 10^ the typical size is /c^ ~ 1. Therefore, 
the larger the sample size the larger become the fluctua- 
tions of the measured cluster sizes k'^^^ and the expression 
in Eq. (|24(l becomes ill defined. Considering that a typical 
cluster size is given by the number of observed different 
species N* divided by the sample size M, the described 
method is useful when N* /M > 1. In this case it yields 
reliable results. 



4.3 Direct optimization of the probabilities 

To overcome the mentioned problem, the second proposed 
algorithm deals directly with the rank-ordered distribu- 
tion ri instead of the cluster sizes ki. This method is very 
similar to a Monte Carlo simulation in which the func- 
tion to minimize is the deviation between the predicted 
rank-ordered frequencies and the experimentally observed 
frequencies. 

Given an experimentally determined set of frequencies 
Mf'P {i = L.iV^'^P), e.g., Mi°"P = 25, p = 15, Mg^'P = 
110, etc., with N'^^P being the observed number of different 
species in the sample, the following algorithm determines 
approximately the probabilities: 



(24) 



1. Determine the total number of individuals in the sam- 
pie M = E.=7 Mr". 
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2. Order the frequencies according to their rank, i.e., r^^^ = 
110, r^'^P = 25, r^j'^P = 15, etc. 

3. Determine a trial initial value of the total number of 
species N, for example by means of Eq. Hll|l. i.e., de- 
termine the initial value of N with the assumption that 
the (unknown) probabilities are identical. 

4. Initialize the trial rank-ordered probabilities which are 
to be determined, for example, with = 1/A^. 

5. Predict the rank-ordered observed relative frequencies 
r*, i = 1, . . . , TV which are expected to be found when 
drawing a sample of M individuals according to the 
trial probabilities. This can be done by two different 
methods, either by 

(i) calculating the expected cluster distribution due to 
Eq. H17(l and then the rank-ordered frequencies via 
Eq. 

(ii) or by the following procedure 

(a) draw M random numbers from the interval 
[1,N] with probabilities ri using a Metropolis 
algorithm, 

(b) count the occurrences of the numbers 1,2, ... ,N 
and sort these frequencies due to their rank, 

(c) repeat steps (a) and (b) a number of times, e.g. 
10, and average the rank-ordered distributions. 
(Note that it is essential, first to rank-order 
and then to average.) 

6. Determine the deviation of the experimentally observed 
rank-ordered frequencies {t'^P} and the predicted rank- 
ordered frequencies {r*} 

N 

i^ir)-J2\'r-r:\. (27) 

i=l 

The index (r) indicates that 'ip(^r) is computed based 
on the frequencies r^.^ 

7. Modify the probabilities in order to minimize ipi^r) ■ 

8. proceed with item El until the deviation ipi^r) is suffi- 
ciently small or until no further progress can be achieved. 

The critical step is itemQwhen the probabilities are mod- 
ified. This has been done either in a deterministic way 
similar to Eq. |j2U, or by proposing a Monte Carlo trial 
movement in the rank-ordered frequencies — > -I- Ari, 
with Avi being some random number and subsequent nor- 
malization. This change is accepted if V'(r) (''i + Ari) ^ 
V'(r) (''Oi otherwise it is rejected. Both methods (i) and 
(ii) yield very similar results. In this step the value of the 
total number of species N has to be also modified: after t 
trial movements of the frequencies we propose a trial num- 
ber of species N N+AN, with AN being some random 
number such that N keeps smaller than the initial N cor- 
responding to a uniform distribution. This movement is 

^ Formally Eq. (1271 is not perfectly correct. Since the in- 
dices i in the distribution {r*} are no integers (see above), 
X(r) has to be computed as the integral difference between two 
functions with i being the integration variable. Since the pre- 
cise mathematical notation appears to be cumbersome without 
contributing to deeper understanding we leave Eq. I|27|l in its 
present form. 



accepted if ipr){N + AN) < V'(r)(-^), otherwise it is re- 
jected. Alternatively the optimization could be performed 
for several values of A^, as proposed in Sec. 14.21 in order 
to find the minimum of the function ip(^r) (N) . 

We wish to demonstrate the performance of this al- 
gorithm by an example. Using the step- wise probability 
distribution given by Eq. H18I) we have drawn two samples 
of sample sizes M = 2000 and = 5000, respectively. 
The according rank-ordered frequencies r™PM are shown 
in Fig. |S1 (upper plot, solid lines). These values serve as 
input (experimental data) to our algorithm, i.e., we ap- 
ply the algorithm to re-infer the true step-wise probabil- 
ity distribution from these samples. Applying the algo- 




rank number / 

2| 




20 40 60 80 100 

rank number / 



Fig. 8. Top, solid lines: rank-ordered normalized frequencies 
r^^^ generated from the step-wise probability distribution Eq. 
lITHIl (data scaled by M). The upper curve corresponds to the 
sample size M = 5000, the lower one to M = 2000. These 
curves serve as input to our algorithm. Top, dashed lines: cor- 
responding expected observed probabilities r* for sample sizes 
M — 5000 and M = 2000, respectively, as generated from 
the optimized set of probabilities. Bottom: solid line: original 
probability distribution given by Eq. I|18|l . Dashed curve: rank- 
ordered probabilities as inferred by the described optimization 
algorithm based on a sample of size Af = 5000. Dot-dashed 
curve: same but for M = 2000. 

rithm to these data we obtain the results shown in the 
lower part of Fig. |H1 Using the larger data set M — 5000 
(dashed line) we reproduced the original function (solid 
line) up to a good accuracy. Given the significant defor- 
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mation of the measured frequencies shown in the upper 
part of the figure, the quahty of the result surprises. Even 
for M = 2000 where in the upper part of the figure (lower 
dashed line) the three-step function can hardly be recog- 
nized, the agreement of the numerical result of the opti- 
mization procedure (lower plot, dot-dashed line) and the 
original set of probabilities (solid line) is agreeable. 

Figure El shows the deviation of the input frequency 
distributions from the frequency distributions which have 
been generated from the optimized probability distribu- 
tion, according to Eq. l(77|l. After about 100,000 optimiza- 




,2|j \ \ 

MOOO 10000 100000 

optimization steps 



Fig. 9. Deviation of the input frequency distributions from 
the frequency distributions which have been generated from 
the optimized probability distribution as defined by Eq. 1271 
over the number of iteration cycles. The upper line shows the 
deviation for M = 5000, the lower for M = 2000. 

tion loops the result does not improve anymore. We ex- 
pect that at this level the accuracy of the approximation 
k^^^ ~ (fcj) is reached. 

5 Discussion and perspectives 

In this article we propose a method to reconstruct the true 
probability distribution of species from a set of frequency 
distributions obtained from a small sample. This method 
relies on the fact that the observed rank-ordered distri- 
bution of probabilities r* for a finite sample of size M 
can be predicted from the true rank-ordered distribution 
ri. Although we have not given any mathematical strict 
proof of the theoretical expression of r* , we have shown 
that this method gives quantitatively good results for the 
cases studied. 

As mentioned, the rank-ordered distribution contains 
less information than the species-ordered distribution: the 
identity of the species is lost. Nevertheless, many statis- 
tical quantities are invariant with respect to the species 
labeling. Any quantity defined as the sum over all species 
of a function of their probability is insensitive to the order 
of the species. The Shannon entropy, for example, can be 
written as 

H = - pjogpi = - ^ r,\ogn. (28) 

i,allspccics i,allranks 



Equation (|20|l allows for the prediction of the value of 
the observed entropy, defined by 

N' AI . . 

H*--Y: r* log< ^-Y: k^i^ log ^ . (29) 

i=l 1=1 

This quantity is experimentally accessible and serves in 
practical applications as a measure for deviations from the 
equidistribution. Since even for a perfect equidistribution 
of the species H* deviates from the entropy H — log N due 
to finite size effects as shown in Sec. 12 empirical entropies 
which are based on different sample size M cannot be 
compared directly with each other. The method proposed 
here enables us to subdivide the deviations H* — H into 
a part due to the finite sample size M, 

M 

-Y.^^^)ii'-^ii^ (30) 

1=1 

where the {ki) are given by Eq. and a part which 
is related to the true deviations from the equiprobability 
distribution. This way we can compare also distributions 
which are based on different sample sizes. In the same way 
we can also quantify sample-size independent deviations 
from any other distribution if we compute the expected 
cluster sizes in the expression (|5n)l due to Eq. ((T7|) . 

Another question which can be solved using the meth- 
ods developed here is the evaluation of mean values of 
fluctuating quantities 

N 

(A) = ^A,p,. (31) 

1=1 

This quantity may be determined by the following proce- 
dure: We introduce first the set = Aipi and the corre- 
sponding observed set 



The set of the rank ordered numbers and, therefore, 
{A) may be determined from the observed quantities a* 
in precisely the same way as shown for the probabilities in 
this paper. We believe that this new method to estimate 
mean values may have many interesting applications. 

The described correction algorithms have been applied 
to some biological relevant examples such as the spatial 
distribution of point mutations in genes \1A\. 
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